The COVID-19 pandemic stands as one of the most significant global health crises in modern history, reshaping societies, economies, and healthcare systems worldwide. This project analyzes global COVID-19 trends using data from the Johns Hopkins University dataset to uncover patterns in confirmed cases and deaths, compare continental trajectories, and model pandemic growth over time. Through exploratory data analysis, predictive modeling, and critical evaluation, this report aims to provide a data-driven perspective on how the pandemic evolved across different regions — and the lessons embedded in the numbers.
How did COVID-19 cases and deaths evolve globally over time, and what were the differences in trends between continents?
The data was sourced from the Johns Hopkins University COVID-19 GitHub Repository. It includes daily time-series data on confirmed cases, deaths, and recoveries across all countries and regions, from the beginning of the pandemic.
# Import data
confirmed <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
deaths <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
# Tidy data: pivot longer
confirmed_long <- confirmed %>%
pivot_longer(cols = starts_with("1"), names_to = "date", values_to = "confirmed") %>%
mutate(date = mdy(date))
deaths_long <- deaths %>%
pivot_longer(cols = starts_with("1"), names_to = "date", values_to = "deaths") %>%
mutate(date = mdy(date))
# Merge confirmed and deaths
covid_data <- confirmed_long %>%
left_join(deaths_long %>% select(`Province/State`, `Country/Region`, Lat, Long, date, deaths),
by = c("Province/State", "Country/Region", "Lat", "Long", "date"))
# Group by country and date
global_data <- covid_data %>%
group_by(`Country/Region`, date) %>%
summarize(confirmed = sum(confirmed), deaths = sum(deaths), .groups = "drop")
# Visualization 1: Global Confirmed Cases Over Time
global_data %>%
group_by(date) %>%
summarize(total_cases = sum(confirmed)) %>%
ggplot(aes(x = date, y = total_cases)) +
geom_line(size = 1) +
labs(title = "Global COVID-19 Confirmed Cases Over Time",
x = "Date", y = "Total Confirmed Cases") +
scale_y_continuous(labels = comma) +
theme_minimal()
# Visualization 2: Top 5 Countries by Deaths
top_countries <- global_data %>%
filter(date == max(date)) %>%
arrange(desc(deaths)) %>%
slice(1:5) %>%
pull(`Country/Region`)
global_data %>%
filter(`Country/Region` %in% top_countries) %>%
ggplot(aes(x = date, y = deaths, color = `Country/Region`)) +
geom_line(size = 1) +
labs(title = "COVID-19 Deaths Over Time for Top 5 Countries",
x = "Date", y = "Total Deaths") +
scale_y_continuous(labels = comma) +
theme_minimal()
# Prepare data
global_cases <- global_data %>%
group_by(date) %>%
summarize(total_confirmed = sum(confirmed)) %>%
mutate(days_since_start = as.numeric(date - min(date)))
# Model
model <- lm(total_confirmed ~ days_since_start, data = global_cases)
summary(model)
##
## Call:
## lm(formula = total_confirmed ~ days_since_start, data = global_cases)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72214741 -54580975 7228886 29568039 179781581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -179781024 5970320 -30.11 <2e-16 ***
## days_since_start 756211 8150 92.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50020000 on 377 degrees of freedom
## Multiple R-squared: 0.958, Adjusted R-squared: 0.9579
## F-statistic: 8608 on 1 and 377 DF, p-value: < 2.2e-16
I explored continent-level comparisons (example below):
# Add continent manually (simplified)
continent_data <- covid_data %>%
mutate(continent = case_when(
`Country/Region` %in% c("US", "Canada", "Mexico") ~ "North America",
`Country/Region` %in% c("China", "Japan", "India") ~ "Asia",
`Country/Region` %in% c("Italy", "Spain", "France", "UK") ~ "Europe",
`Country/Region` %in% c("Brazil", "Argentina") ~ "South America",
TRUE ~ "Other"
))
continent_summary <- continent_data %>%
group_by(continent, date) %>%
summarize(confirmed = sum(confirmed), deaths = sum(deaths), .groups = "drop")
# Visualization: Cases by Continent
continent_summary %>%
ggplot(aes(x = date, y = confirmed, color = continent)) +
geom_line(size = 1) +
labs(title = "COVID-19 Confirmed Cases by Continent Over Time",
x = "Date", y = "Confirmed Cases") +
scale_y_continuous(labels = comma) +
theme_minimal()
The linear regression model showed a strong positive relationship between the number of days since the start of the pandemic and the total confirmed COVID-19 cases globally. Specifically:
In short, the model correctly identifies the general rising trend in cases but oversimplifies the complex real-world dynamics of pandemic progression.
This analysis was conducted using RMarkdown, ensuring a transparent,
fully reproducible workflow from data import to final outputs. All code,
data transformations, and visualizations are embedded directly within
the document.
Key steps to guarantee reproducibility:
By structuring the project this way, the results remain verifiable, auditable, and easily adaptable for future updates or peer review. This approach reflects best practices for transparent, reproducible, and ethical data science research.
The COVID-19 pandemic followed an explosive global trajectory, marked by sharp early surges and persistent waves across different regions. North America and Europe experienced some of the highest case and death counts, while Asia showed relatively faster containment after initial outbreaks.
The United States emerged as the country with the highest cumulative deaths, highlighting significant differences in pandemic management across nations. Seasonal trends — especially during winter months — further amplified case growth, underscoring the impact of social behavior on virus spread.
While the Johns Hopkins dataset provides a powerful lens into pandemic dynamics, it is crucial to interpret findings within the context of reporting inconsistencies, undercounting, and structural biases. In sum, this analysis reaffirms the importance of timely interventions, transparent data reporting, and cross-country collaboration in managing global health crises.